Finding Past Anomalies
Learn running simulations on past data to find anomalies.
In the previous section, we identified an anomaly. We found an increase in the 400
status code because the z-score was 6. But how do we set the threshold for the z-score? Is a z-score of 3 an anomaly? What about 2, or 1?
To find thresholds that fit our needs, we can run simulations on past data with different values and evaluate the results. This is often called backtesting.
The first thing we need to do is to calculate the mean and the standard deviation for each status code up until every row, just as if it is the current value. This is a classic job for a window function:
We use a window function to calculate the mean and standard deviation over a sliding window of 60 minutes. To avoid repeating the WINDOW clause for every aggregate, we define a named window called “status_window.” This is another nice feature of PostgreSQL.
In the results, we can now see that we have the mean and standard deviation of the previous 60 rows for every entry. This is similar to the calculation we did in the previous section except this time we’ll do it for every row.
Now we can calculate the z-score for every row:
We now have z-scores for every row, and we can try to identify anomalies:
We decided to classify values with a z-score greater than 3 as anomalies.You’ll often find that 3 is the magic number in textbooks, but do not get sentimental about it, because you can definitely change it to get better results.